



# Massively Parallel Algorithms Dense Matrix Algorithms

G. Zachmann
University of Bremen, Germany
cgvr.cs.uni-bremen.de



# Warming Up: Matrix-Vector Product



Given matrix A, and vector x, compute

$$y = Ax$$

- One of the most important operations in linear algebra algorithms
  - Called SGEMV in BLAS (Basic Linear Algebra Subroutines)
- First approach: one thread per row



• Observation: all threads use the same data from  $x \rightarrow$  shared memory



#### Algorithm for First Attempt



```
multMatrixVector( const float * A, const float * x,
                   const int n columns, float * y )
     shared x cache[ THREADS PER BLOCK ];
   vi = 0.0;
                                        // output of each thread
   int i = threadIdx.x + blockIdx.x * blockDim.x; // row index
   for ( int j = 0; j < n columns; j += THREADS PER BLOCK )</pre>
      // new segment of columns -> fill cache
      x cache[threadIdx.x] = x[ j + threadIdx.x ];
      // now process this segment of columns
      for ( int k = 0; k < THREADS PER BLOCK; k ++ ) {</pre>
         Aij = A[ i*n columns + j+k ];
         yi += Aij*x cache[k];
                                      Block of
                                      threads
                                                                   Block-
   y[i] = yi;
                                                                   size
                                                            *
```

For sake of clarity, we assume M, N = multiple of block-size

**Blocksize** 





- The "natural" (C) way to store matrices is called row major order
  - $A_{ij}$  is stored at memory address  $\mathbf{A} + \mathbf{i} * \mathbf{n} + \mathbf{j}$
- For a conventional (sequential) matrix-vector-multiplication algorithm, this is good:

| <del></del> |    |    |    |  |  |
|-------------|----|----|----|--|--|
| 0           | 1  | 2  | 3  |  |  |
| 4           | 5  | 6  | 7  |  |  |
| 8           | 9  | 10 | 11 |  |  |
| 12          | 13 | 14 | 15 |  |  |
| 16          | 17 | 18 | 19 |  |  |

```
for ( int i = 0; i < M; i ++ ) {
   float yi = 0.0;
   for ( int j = 0; j < N; j ++ )
      yi += A[i][j] * x[j];
   y[i] = yi;
```



May 2014



# 2D Array Access Patterns (row major vs column major)



Consider the following piece in a kernel (e.g., matrix × vector):

```
for ( int j = 0; j < blockDim.x; j ++ ) {
   float Aij = A[treadIdx.x][j];
   ... do something with it ...</pre>
```



- Problem: uncoalesced access pattern
  - Elements read on 1<sup>st</sup> SIMT access: 0, 32, 64, ...
  - Elements read on 2<sup>nd</sup> SIMT access: 1, 33, 65, ...
  - Also, extra data will be transferred in order to fill the cache line size
- Generally, most natural access pattern for direct port of a C/C++ code!



#### Transposed 2D Array Access Pattern



- Column major := store a logical row in a physical column
  - I.e.,  $A_{00} \to A[0][0]$ ,  $A_{01} \to A[1][0]$ ,  $A_{02} \to A[2][0]$ , ...  $A_{10} \to A[0][1]$ ,  $A_{11} \to A[1][1]$ ,  $A_{12} \to A[2][1]$ , ...  $A_{20} \to A[0][2]$ , ...

|              | 0 | 5 | 10 | 15 |
|--------------|---|---|----|----|
|              | 1 | 6 | 11 | 16 |
| $\downarrow$ | 2 | 7 | 12 | 17 |
|              | 3 | 8 | 13 | 18 |
|              | 4 | 9 | 14 | 19 |

- In general:  $A_{ij}$  is stored at  $\mathbf{A} + \mathbf{i} + \mathbf{j} * \mathbf{n}$
- Transform the code to column major:

```
for ( int j = 0; j < blockDim.x; j ++ ) {
   float Aij = A[j][treadIdx.x];
   ... do something with it ...</pre>
```

- Now, we have coalesced accesses:
  - Elements read on 1<sup>st</sup> SIMT access: 0, 1, 2, ..., 31
  - Elements read on 2<sup>nd</sup> SIMT access: 32, 33, ..., 63





#### Modified Matrix\*Vector Algorithm for Column-Major Matrix Storage



```
multMatrixVector( const float * A, const float * x,
                   const int n columns , float * y )
     shared x cache[ THREADS PER BLOCK ];
   vi = 0.0;
                                          // output of each thread
   int i = threadIdx.x + blockIdx.x * blockDim.x; // row index
   for ( int j = 0; j < n columns; j += THREADS PER BLOCK )</pre>
      // new segment of columns -> fill cache
      x cache[threadIdx.x] = x[ j + threadIdx.x ];
      // now process this segment of columns
      for ( int k = 0; k < THREADS PER BLOCK; k ++ ) {</pre>
         Aij = A[i + (j+k)*n columns];
         yi += Aij * x cache[k];
                                 Note: n columns is still the
   y[i] = yi;
                                 number of columns of the logical matrix,
                                 not the number of columns of the physical matrix!
```





- Note: from now on, we will use row-major notation (just for sake of clarity)!
  - But we will assume that an actual implementation uses column-major!
  - We expect you to transform everything to column-major
  - Start with small matrices that you can check "by hand"
  - Or implement your code first on the CPU and test it there





- Do we keep all hardware resources of the GPU busy?
- Assume Fermi [2011] hardware:
  - 14 SMs, each supports 1536 active threads
  - If  $M < 21504 = 14 \times 1536 \rightarrow \text{some SMs are idle!}$
- Idea for the case M < 21504 and N "not too small":</p>
  - Use 2D partitioning of our problem/domain







- All possible domain decomposition variants:
  - 1. One thread per row
  - 2. Several threads per row (previous slide)
  - 3. Several rows per thread (one thread computes several y[i]'s at the same time)
  - 4. Several threads, several rows (version 2 & 3 combined)
- Which version is best in which case? (YMMV)







#### Computational performance that can be achieved [2011]:



Performance of matrix-vector multiplication (SGEMV) over matrices of size  $m \times n$ 

["Fast High-performance Modeling Tools for Many-core Architectures ", Glimberg et al., 2011]



#### **Complexities**



- Sequential version:  $O(n^2)$  (assuming n=m)
- Parallel version: O(n) parallel time
  - Assuming *n* parallel threads
- Arithmetic intensity:
  - Assume following simplified version:

- Number of slow memory references =  $f = 2n + n^2$
- Number of arithmetic operations =  $o = 2n^2$
- Arithmetic intensity  $a = \frac{o}{f} \approx 2 \rightarrow$  memory limited





- Remark: actually, SGEMV in BLAS computes  $\mathbf{y} = \alpha A \mathbf{x} + \beta \mathbf{y}$ 
  - Should be fairly straight-forward to modify our kernels



## Matrix-Matrix Multiplication



- Called SGEMM in BLAS
- Given matrices A and B, compute  $P = A \cdot B$
- For sake of simplicity, we'll assume
   A and B are square matrices of size n×n
- Sequential algorithm:

```
for i = 1 ... n:
  for j = 1 ... n:
    s = 0.0
  for k = 1 ... n:
    s += A[i][k] * B[k][j]
  P[i][j] = s
```



B





- Complexity:  $O(n^3)$
- Arithmetic intensity:  $a = \frac{2n^3}{2n^3 + n^2} \approx 1$

```
for i = 1 ... n:
  for j = 1 ... n:
    s = 0
    for k = 1 ... n:
     s += A[i][k] * B[k][j]
    P[i][j] = s
```

- Even worse than matrix-vector mult.!
- Upper bound, w/o proof, at least with iterative = non-recursive algorithms:

$$\hat{a} = \frac{2n^3}{3n^2} \in O(n)$$

Problem: no data re-use!



#### Naïve Parallel Matrix Multiplication



- Approach:
  - Use matrix-vector-multiplication idea
  - Run one thread per row of A:

```
for j = 1 ... n:
    read column j of B into fast memory (B_cache)
    foreach i = 1 ... n run one thread in parallel:
        s = 0.0
        for k = 1 ... n:
        s += A[i][k] * B_cache[k][j]
        P[i][j] = s
```

Arithmetic intensity:

$$a=\frac{2n^3}{n^2+n^3}\approx 2$$

■ Not much better ⊗



B



## Blocked (Tiled) Matrix Multiplication



Remember linear algebra class: the procedure

$$p_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$$

works also for sub-blocks of the matrices

$$P_{ij} = \sum_{k=1}^{n/m} A_{ik} B_{kj}$$





- In production code, you'd have to cope with any matrix size!
  - Lots of nitty-gritty details ...











- New approach (2D partitioning):
  - For each sub-matrix  $P_{ij}$ , run one block of  $m^2$  threads
  - Each thread in the block computes one p<sub>ij</sub>
  - The kernel runs in phases
- Each phase consists of:
  - 1. Load blocks  $A_{ik}$ ,  $B_{kj}$  into shared memory
    - Each thread loads one  $a_{ij}$ , one  $b_{ij}$
  - 2. Perform "row × column" over block
  - 3. Accumulate partial results









#### Pseudo code:

```
dim3 threadsPerBlock(m,m);
dim3 n_blocks( n/m, n/m );
multMatrices<<< n_blocks, threadsPerBlock >>>( A, B, P, n );
```





- Previous optimization is called blocking/tiling (copy optimization)
- How should matrices A and B be stored?
  - Remember: at the beginning of each phase: each thread loads one  $a_{ij}$  & one  $b_{ij}$
- Store matrices in blocked form, in order to achieve coalesced memory access:

Original matrix (numbers are addresses)

|          | 0 | 4 | 8  | 12 |
|----------|---|---|----|----|
|          | 1 | 5 | 9  | 13 |
| <b>↓</b> | 2 | 6 | 10 | 14 |
|          | 3 | 7 | 11 | 15 |

Reorganized into blocks

| 0 | 2 | 8  | 10 |
|---|---|----|----|
| 1 | 3 | 9  | 11 |
| 4 | 6 | 12 | 14 |
| 5 | 7 | 13 | 15 |





- Arithmetic intensity:
  - P consists of b<sup>2</sup> blocks
  - For each block  $P_{ij}$ , we load b blocks of A and b blocks of B
  - Overall, our algorithm loads 2b<sup>3</sup> many blocks
  - One block load =  $m^2$  float loads
  - $b = \frac{n}{m}$
  - Overall, our algorithm loads  $2\left(\frac{n}{m}\right)^3m^2=2\frac{n^3}{m}$  many floats
  - Therefore,  $a = \frac{2n^3}{2\frac{n^3}{m}} = m$
- Consequence: make m large
- Limit: all three blocks  $P_{ij}$ ,  $A_{ik}$ ,  $B_{kj}$ , must fit in shared memory





#### Calculating m:

- Assume Kepler-GPU: ~ 2 TFlops/sec = 2·10<sup>12</sup> FLOPs/sec ,
   ~ 200 GB/sec = 200·10<sup>9</sup> B/sec
- Choose m such that we achieve peak bandwidth & peak FLOPs/sec

$$m = a = \frac{\text{\# FLops}}{\text{\# Loads}} = \frac{\text{\# Flops/sec}}{\text{\# Loads/sec}} = \frac{2 \cdot 10^{12} \text{ Flops/sec}}{\frac{200}{4} \cdot 10^9 \text{ B/sec}} = 40$$

$$1 \text{ Load} = 4 \text{ Bytes}$$

- Note: these are very crude estimations, but good for a starting point where to search for the sweet spot
- Consequence: size of shared memory should be at least  $3 \cdot 40^2 \cdot 4$  Bytes = 19.2 kB
  - Otherwise, we would be bandwidth limited



#### Effects of Block Size







### Comparison with MKL (Intel) [2001]





[ http://www.scribd.com/doc/47501296/CUDA-3-2-Math-Libraries-Performance ]



#### **Limitations / Optimality**



- Tiling/blocking only works, if the arithm. operation is associative
- Arithmetic intensity, a, is bounded by size of shared memory, S:

$$a \approx m \leq \sqrt{\frac{S}{3}}$$

- Our algorithm performs  $O(\frac{n^3}{\sqrt{S}})$  many load operations
- Note: in a sense, our blocked matrix multiplication algorithm is a way to schedule memory transfers and floating point operations
- Theorem (Hong & Kung, 1981; w/o proof): Any schedule of conventional matrix multiplication must transfer  $O(\frac{n^3}{\sqrt{S}})$  many floats between slow and fast memory.
- In this sense, blocked matrix multiplication is optimal



## Strassen's Algorithm





- All "traditional" algorithms need  $O(n^3)$  FLOPs
- Strassen's algorithm:  $O(n^{2.81})$ 
  - Recursive algorithm!
  - See 2<sup>nd</sup> semester's course "algorithms and data structures"
- Current world record:  $O(n^{2.376})$
- Strassen on the GPU?
  - Probably not worth it (recursion / complex control flow)

Matrix Algorithms



# Recap: Strassen's Algorithm Optional



- Task: compute  $C = A \cdot B$ ,  $A, B \in \mathbb{R}^{n \times n}$
- Idea : divide-and-conquer
  - Partition A, B, C in 2x2 block matrices

$$\begin{pmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \cdot \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{pmatrix}$$
 mit  $a_{ii}, b_{ii}, c_{ii} \in \mathbb{R}^{\frac{n}{2} \times \frac{n}{2}}$ 

Multiplication gives:

$$c_{11} = a_{11}b_{11} + a_{12}b_{21}$$
  
:  
 $c_{22} = a_{21}b_{11} + a_{22}b_{22}$ 

• Which amounts to 8 matrix multiplications of size  $\frac{n}{2} \times \frac{n}{2}$ 







The trick: compute some (seemingly tedious) intermediate products

$$Q_1 \equiv (a_{11} + a_{22})(b_{11} + b_{22})$$
 $Q_2 \equiv (a_{21} + a_{22})b_{11}$ 
 $Q_3 \equiv a_{11}(b_{12} - b_{22})$ 
 $Q_4 \equiv a_{22}(-b_{11} + b_{21})$ 
 $Q_5 \equiv (a_{11} + a_{12})b_{22}$ 
 $Q_6 \equiv (-a_{11} + a_{21})(b_{11} + b_{12})$ 
 $Q_7 \equiv (a_{12} - a_{22})(b_{21} + b_{22})$ 

• Now we can compute the  $c_{ij}$ 's like so:

$$c_{11} = Q_1 + Q_4 - Q_5 + Q_7$$

$$c_{12} = Q_2 + Q_4$$

$$c_{21} = Q_3 + Q_5$$

$$c_{22} = Q_1 + Q_3 - Q_2 + Q_6$$



# **Optional**



Computational complexity:

$$T(n) = 7T\left(\frac{n}{2}\right) + cn^2 \in O(n^{2.8...})$$

- Assumption here: multiplications are the expensive operation
- How would this perform on a GPU?



## Application: All Pairs Shortest Paths (APSP)



• Given: directed graph G = (V, E) and a distance function

$$\mathsf{dist}: E \to \mathbb{R}$$

where V = set of all vertices (nodes), |V| = n, and E = set of edges

- Goal: compute  $n \times n$  matrix  $D = d_{ij}$  that stores for each pair  $(v_i, v_j)$  the shortest path from  $v_i$  to  $v_j$  in graph G
- Example:



|   | 1 | 2  | 3  | 4 | 5  |
|---|---|----|----|---|----|
| 1 | 0 | 3  | 8  | 4 | 4  |
| 2 | 3 | 0  | 6  | 1 | 7  |
| 3 | 7 | 4  | 0  | 5 | 11 |
| 4 | 2 | 5  | 5  | 0 | 6  |
| 5 | 8 | 11 | 11 | 6 | 0  |
|   |   |    |    |   |    |

Shortest path matrix D



#### The Adjacency Matrix Representation of Directed Graphs



- The adjacency matrix A represents the distance function dist
- A is an  $n \times n$  matrix  $A = (\delta_{ij})$  where

$$\delta_{ij} = \begin{cases} \operatorname{dist}(v_i, v_j), & \text{if } (v_i, v_j) \in E \\ \infty, & \text{if } (v_i, v_j) \notin E \\ 0, & \text{if } i = j \end{cases}$$

#### Example:



|   | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 1 | 0 | 3 | 8 | 8 | 4 |
| 2 | 8 | 0 | 8 | 1 | 7 |
| 3 | 8 | 4 | 0 | 8 | 8 |
| 4 | 2 | 8 | 5 | 0 | 8 |
| 5 | 8 | 8 | 8 | 6 | 0 |

Adjacency matrix



#### The Shortest Paths Property



- We will now extend the simple, edge-based distance function to a distance function dist' on paths
- Define

$$\operatorname{dist'}(p_{ij}^1) = egin{cases} 0, & i = j \ \delta_{ij}, & i 
eq j \end{cases}$$

- Consider a shortest path  $p_{ij}^k$  from  $v_i$  to  $v_j$  such that  $|p_{ij}^k| \le k$ , i.e.,  $p_{ij}^k$  can have most k edges
  - Let  $(v_l, v_j)$  be the last edge of path  $p^{k_{ij}}$
  - Then, there must be a *shortest* path  $p_{il}^{k-1}$  from  $v_i$  to  $v_l$  (optimal substructure!)
- Therefore,

$$\mathsf{dist'}(p_{ij}^k) = \mathsf{dist'}(p_{il}^{k-1}) + \delta_{lj}$$



#### A Simple Algorithm for APSP



- Given the adjacency matrix A, compute a series of matrices  $D^1=A$ ,  $D^2$ , ...,  $D^{n-2}$ ,  $D^{n-1}$  where matrix  $D^k=\operatorname{dist}'(p_{ij}^k)$  contains lengths of shortest paths in G with at most k edges
- Example:



|   | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 1 | 0 | 3 | 8 | 8 | 4 |
| 2 | 8 | 0 | 8 | 1 | 7 |
| 3 | 8 | 4 | 0 | 8 | 8 |
| 4 | 2 | 8 | 5 | 0 | 8 |
| 5 | 8 | 8 | 8 | 6 | 0 |
|   |   |   |   |   |   |

Adjacency matrix

|                       | 1 | 2 | 3  | 4 | 5  |
|-----------------------|---|---|----|---|----|
| 1                     | 0 | 3 | 8  | 4 | 4  |
| 2                     | 3 | 0 | 6  | 1 | 7  |
| 3                     | 8 | 4 | 0  | 5 | 11 |
| 4                     | 2 | 5 | 5  | 0 | 6  |
| 5                     | 8 | 8 | 11 | 6 | 0  |
| Matrix D <sup>2</sup> |   |   |    |   |    |

• Final matrix  $D^{n-1}$  contains the actual shortest paths in G





The algorithm:

```
A = adjacency matrix D^1 = A for k = 2 to n-1: D^k = ExtendPaths(D^{k-1}, A) return D^k
```

```
\begin{split} & \underline{\text{ExtendPaths}(\ D,\ A\ )} \\ & E = e_{ij} \text{ is an } n \times n \text{ distance matrix} \\ & \text{for } i = 1 \text{ to } n \text{:} \\ & \text{for } j = 1 \text{ to } n \text{:} \\ & e_{ij} = d_{ij} \\ & \text{for } k = 1 \text{ to } n \text{:} \\ & e_{ij} = \min\{e_{ij},\ d_{ik} + \delta_{kj}\} \\ & \text{return } D \end{split}
```

```
\begin{split} & \underline{\text{MatrixMultiply(B, A)}} \\ & C = c_{ij} \text{ is an nxn result matrix} \\ & \text{for i = 1 to n:} \\ & \text{for j = 1 to n:} \\ & c_{ij} = 0 \\ & \text{for k = 1 to n:} \\ & c_{ij} = c_{ij} + a_{ik} \cdot b_{kj} \\ & \text{return C} \end{split}
```

- Notice the similarity with matrix multiplication!
  - We can adapt our fast GPU-based matrix multiplication code to solve the APSP problem quite easily



# A Word on Sparse Matrices ptional



- Just some remarks
- Frequent case: sparse band matrices
  - Represent matrix as a number of vectors
  - Devise new parallel algorithm (one thread per row is inefficient)





# **Optional**



- "Unstructured" sparse matrices:
  - Most common storage format is Compressed Sparse Row (CSR)





# **Optional**



- Many more kinds of sparse matrices
  - Specialized representation / algorithms for each of them?





## Summary



- Simple performance models can aid in understanding
- Two ratios are key:
  - Arithmetic (computational) intensity =  $\frac{\# \text{ flops}}{\# \text{ mops}}$ 
    - "flops" = floating point operations, "mops" = memory operations
  - Machine balance =  $\frac{\text{Tflops/sec}}{\text{GB/sec}}$